In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.
['input_data']
In [2]:
from tqdm import tqdm
from statsmodels.graphics.gofplots import qqplot

1.0 Data files available

In [3]:
ls '../input/input_data/'
drive/  trip/  vehicle.csv  weather/
  • 3 folders are provided, each for drive, trip and weather data; these folders have multiple parquet files which will need to be contacted togther to yield 1 consolidated dataframe for each source
  • In this analysis, we will focus on Drive features' eda analysis

2.0 Read drive data

In [4]:
### get list of all the files for drive
driveF = [i for i in os.listdir('../input/input_data/drive/') if 'parquet' in i]
print("Total {} partial files for drive features".format(len(driveF)))
Total 43 partial files for drive features
In [5]:
### lets build one dataframe each for trip, drive, and weather
Path = '../input/input_data/'

def consolidateFiles(sourceType:str,iterfiles:list):
    print(" ------  Consolidating for {} ------- ".format(sourceType))
    outDF = pd.DataFrame()
    for f_ in tqdm(iterfiles):
        outDF = pd.concat([outDF,pd.read_parquet(os.path.join(Path,'{}/{}'.format(sourceType,f_)))],0)
    return outDF
driveDF = consolidateFiles('drive',driveF)
print("All Drive files read & consolidated")
  0%|          | 0/43 [00:00<?, ?it/s]
 ------  Consolidating for drive ------- 
100%|██████████| 43/43 [00:25<00:00,  1.05s/it]
All Drive files read & consolidated

In [6]:
### Quick look into columns
driveDF.head(2)
Out[6]:
vehicle_id trip_id datetime velocity accel_x accel_y accel_z engine_coolant_temp eng_load fuel_level iat rpm
0 1000512 861170e5f30342c78d9b706ae908dc4f 2017-01-06 21:00:00 0.00 87.39 78.39 49.0 145.13 216.57 87.0 71.0 2027.39
1 1000512 861170e5f30342c78d9b706ae908dc4f 2017-01-06 21:00:01 45.59 83.93 77.46 46.0 149.33 227.81 89.0 79.0 2028.45
  1. Univariate analysis
  2. Multivariate analysis
In [7]:
### divide into contin and categorical cols
numericCol = driveDF.select_dtypes(float).columns
others = [i for i in driveDF.columns if i not in numericCol]

Univariate Analysis of numeric variables

In [8]:
import seaborn as sns
import matplotlib.pyplot as plt
In [9]:
for n_ in numericCol:
    sns.distplot([driveDF[n_]])
    plt.title("Distribution for {}".format(n_))
    plt.show()
In [10]:
### Also plot boxplots for outliers
for n_ in numericCol:
    sns.boxplot([driveDF[n_]])
    plt.title("Boxplot for {}".format(n_))
    plt.show()
  • Velocity is normally distributed with a mean($\mu$) ~ 64and variance($\sigma^2$) ~ 175
  • Accel_X,Accel_Y,Accel_Z interestingly all have a bi-modal distribution comprised of almost 2 normal distributions
  • Engine coolant temp is also bi-modal distribution
  • Engine load is approximately normally distributed wih ($\mu$) ~ 204and variance($\sigma^2$) ~ 100
  • Fuel level and rpm do not look like any parametric distribution
  • Velocity and Engine load have a lot of extreme values (>1.5 IQR and < -1.5 IQR)

Deep dive univariate analysis for Accel X,Y,Z, EngLoad, Fuel Level and RPM

In [11]:
### Univariate Analysis for accel_X
sns.distplot(driveDF['accel_x'])
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f26fe89c278>

There are 2 broad behaviours depicted in the above:

  • While driving within city limits and person will only accelarate closer to 50 ish (~ mean of left Gaussian)
  • While outside of city, say onto highways travelling will be slightly higher accelaration (mean ~ 80 for right Gaussain)
  • In terms of feature engg, it might be interesting to also provide info about whether the data is from left or right Gaussian; perhaps even encoding the respective sides' mean in some form might help
In [12]:
### also check if the behavior is consistent for all three directional acceleration elements
In [13]:
### Univariate Analysis for accel_X
sns.distplot(driveDF['accel_x'])
sns.distplot(driveDF['accel_y'])
sns.distplot(driveDF['accel_z'])
plt.plot()
;
Out[13]:
''

We can conclude that there is consistency in terms of directional acceleration elements and they seem to follow similar bi-modal distributions

Eng load

In [14]:
sns.distplot(driveDF['eng_load'])
print("Mean: {0:.1f} Variance: {0:.1f}".format(driveDF['eng_load'].mean(),driveDF['eng_load'].var()))
Mean: 204.5 Variance: 204.5
In [15]:
### confirm normality by QQ plots
qqplot(driveDF['eng_load'])
Out[15]:
In [16]:
from scipy.stats import shapiro
stat, p = shapiro(driveDF['eng_load'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')
Statistics=0.276, p=0.000
Sample does not look Gaussian (reject H0)
/opt/conda/lib/python3.6/site-packages/scipy/stats/morestats.py:1660: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")
  • To conclude, the variable engine load although looks Gaussian, but from a statical standpoint is not normal.

Fuel Level

In [17]:
sns.distplot(driveDF['fuel_level'])
print("Mean: {0:.1f} Variance: {0:.1f}".format(driveDF['fuel_level'].mean(),driveDF['fuel_level'].var()))
Mean: 119.9 Variance: 119.9
  • Interestingly, both mean and variance are at 119.9
  • It could be poisson distribution as mean=variance, but on the other hand fuel-level is not a count variable, so not entirely sure.
  • Uneven distribution with single peak and numerous smaller peaks

RPM

In [18]:
sns.distplot(driveDF['rpm'])
print("Mean: {0:.1f} Variance: {0:.1f}".format(driveDF['rpm'].mean(),driveDF['rpm'].var()))
Mean: 2049.6 Variance: 2049.6
  • Interestingly, both mean and variance are at 2049.6
  • Definitely poisson distribution since rpm is count of rotations per min (and also follows property of poisson dist mean=var).
  • Uneven distribution with single peak and numerous smaller peaks

Univariate Analysis of categorical variables

In [19]:
others
Out[19]:
['vehicle_id', 'trip_id', 'datetime']

Vehicle_ID

In [20]:
print("Count of unique vehicles: {}".format(len(driveDF['vehicle_id'].unique())))
plt.figure(figsize=(20,5))
sns.countplot(driveDF['vehicle_id'])
Count of unique vehicles: 20
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f26fe586dd8>
  • 20 unique vehicle IDs
  • Almost uniform distribution is terms of counts; some vehiles have lower counts but seem fine more or mess

Trip ID

In [21]:
print("Count of unique trips: {}".format(len(driveDF['trip_id'].unique())))
plt.figure(figsize=(20,5))
sns.countplot(driveDF['trip_id'])
Count of unique trips: 1708
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f26fe586b00>
In [22]:
sns.distplot(driveDF['trip_id'].value_counts().values,kde=False)
plt.title("Countplot for trips")
Out[22]:
Text(0.5, 1.0, 'Countplot for trips')
  • 1788 unique vehicle IDs
  • High variation in terms of counts of trip, but the count distribution is more of less uniform, meaning there are almost equal number of short, medium and long trips in the data

Multivariate Analysis

Velocity:

Distribution of velocity by vehicles, trips,times

In [23]:
for v_ in driveDF['vehicle_id'].unique():
    d = driveDF[driveDF['vehicle_id']==v_]
    sns.distplot(d['velocity'])

We can see that on average almost all vehicles speed is normally distributed around mean of 60 and has very similar variance across vehicles

In [24]:
### analysis by top 10 longest and shortest trips
top10shortTrips = driveDF['trip_id'].value_counts()[-10:].index
top10longTrips = driveDF['trip_id'].value_counts()[:10].index
In [25]:
for d_ in top10longTrips:
    d = driveDF[driveDF['trip_id']==d_]
    sns.distplot(d['velocity'])
In [26]:
for d_ in top10shortTrips:
    d = driveDF[driveDF['trip_id']==d_]
    sns.distplot(d['velocity'])
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-26-337796621bc8> in <module>
      1 for d_ in top10shortTrips:
      2     d = driveDF[driveDF['trip_id']==d_]
----> 3     sns.distplot(d['velocity'])

/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py in distplot(a, bins, hist, kde, rug, fit, hist_kws, kde_kws, rug_kws, fit_kws, color, vertical, norm_hist, axlabel, label, ax)
    229     if kde:
    230         kde_color = kde_kws.pop("color", color)
--> 231         kdeplot(a, vertical=vertical, ax=ax, color=kde_color, **kde_kws)
    232         if kde_color != color:
    233             kde_kws["color"] = kde_color

/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py in kdeplot(data, data2, shade, vertical, kernel, bw, gridsize, cut, clip, legend, cumulative, shade_lowest, cbar, cbar_ax, cbar_kws, ax, **kwargs)
    689         ax = _univariate_kdeplot(data, shade, vertical, kernel, bw,
    690                                  gridsize, cut, clip, legend, ax,
--> 691                                  cumulative=cumulative, **kwargs)
    692 
    693     return ax

/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py in _univariate_kdeplot(data, shade, vertical, kernel, bw, gridsize, cut, clip, legend, ax, cumulative, **kwargs)
    292                               "only implemented in statsmodels."
    293                               "Please install statsmodels.")
--> 294         x, y = _scipy_univariate_kde(data, bw, gridsize, cut, clip)
    295 
    296     # Make sure the density is nonnegative

/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py in _scipy_univariate_kde(data, bw, gridsize, cut, clip)
    364     """Compute a univariate kernel density estimate using scipy."""
    365     try:
--> 366         kde = stats.gaussian_kde(data, bw_method=bw)
    367     except TypeError:
    368         kde = stats.gaussian_kde(data)

/opt/conda/lib/python3.6/site-packages/scipy/stats/kde.py in __init__(self, dataset, bw_method, weights)
    206             self._neff = 1/sum(self._weights**2)
    207 
--> 208         self.set_bandwidth(bw_method=bw_method)
    209 
    210     def evaluate(self, points):

/opt/conda/lib/python3.6/site-packages/scipy/stats/kde.py in set_bandwidth(self, bw_method)
    552             raise ValueError(msg)
    553 
--> 554         self._compute_covariance()
    555 
    556     def _compute_covariance(self):

/opt/conda/lib/python3.6/site-packages/scipy/stats/kde.py in _compute_covariance(self)
    564                                                bias=False,
    565                                                aweights=self.weights))
--> 566             self._data_inv_cov = linalg.inv(self._data_covariance)
    567 
    568         self.covariance = self._data_covariance * self.factor**2

/opt/conda/lib/python3.6/site-packages/scipy/linalg/basic.py in inv(a, overwrite_a, check_finite)
    972         inv_a, info = getri(lu, piv, lwork=lwork, overwrite_lu=1)
    973     if info > 0:
--> 974         raise LinAlgError("singular matrix")
    975     if info < 0:
    976         raise ValueError('illegal value in %d-th argument of internal '

LinAlgError: singular matrix
  • As expected, longer trips generally have stable velocity distribution mostly normal distribution (with mean around 50-80)
  • Shorter trips seem to spend a lot of time around low velocity values, but generally peak around 55 ish.
In [27]:
### analysis of velocity by hours
uqHour = driveDF['datetime'].dt.hour.unique()
plt.figure(figsize=(20,5))
for h_ in uqHour:
    d = driveDF[driveDF['datetime'].dt.hour==h_]
    sns.distplot(d['velocity'])
    plt.title('Distribution of velocty across hours')
In [28]:
for h_ in np.sort(uqHour):
    print(h_,driveDF[driveDF['datetime'].dt.hour==h_]['velocity'].mean(),driveDF[driveDF['datetime'].dt.hour==h_]['velocity'].var())
0 62.828340221507524 162.5876328564705
1 62.9273121416194 169.22254140804327
2 63.82647143102528 161.80035364076593
3 63.804105930351874 163.18256553815996
4 63.4348056732354 178.94702644570233
5 64.54915625449173 160.88391053577575
6 64.65018438535488 174.14197522745926
7 64.99542625081553 176.1721920172477
8 64.83599104971232 176.86612454582837
9 64.7598423921381 171.3348353927419
10 65.55367019467207 164.0426946929499
11 64.59642408184018 164.0957866669961
12 64.09034826186185 180.27271196668735
13 64.54282586109073 182.22160036691147
14 63.30947419813814 173.75045105700562
15 64.53816159610051 179.7365297604939
16 65.1218428550299 167.49242709739278
17 64.89142690555363 178.66282802071325
18 65.54553489233783 177.222732308641
19 64.05247079474796 182.36953508043692
20 64.09793106678575 184.24498686198615
21 65.20456703136097 182.9035259174711
22 65.01621053193031 178.61706159326292
23 63.06179717237424 171.75964163725033
  • Almost similar normal distribution for velocity across hours
  • Around midnight and early morning hours generally see lower speeds on avg with avg speed rising as day moves on
In [29]:
### velocity X accel
consolAccel = (driveDF['accel_x']**2+driveDF['accel_y']**2+driveDF['accel_z']**2)**(0.5)
print('correlation b/w veloctiy and consolidated acceleration: {:0.2f}%'.format(np.corrcoef(driveDF['velocity'],consolAccel)[1][0]*100))
correlation b/w veloctiy and consolidated acceleration: 0.41%
In [30]:
### correlation b/w velocity,temp, engLoad, fuel, rpm
In [31]:
driveDF[['velocity','engine_coolant_temp','eng_load','fuel_level','rpm']].corr()
Out[31]:
velocity engine_coolant_temp eng_load fuel_level rpm
velocity 1.000000 -0.007641 0.007363 -0.014962 -0.013565
engine_coolant_temp -0.007641 1.000000 0.006794 0.009182 0.012996
eng_load 0.007363 0.006794 1.000000 -0.004318 0.007677
fuel_level -0.014962 0.009182 -0.004318 1.000000 0.029737
rpm -0.013565 0.012996 0.007677 0.029737 1.000000

There does not seem to be much correlation b/w velocity,temp, engLoad,fuel_level or rpm

Multi-variate analysis for temperature

Across vehicles

In [32]:
for v_ in driveDF['vehicle_id'].unique():
    d = driveDF[driveDF['vehicle_id']==v_]
    sns.distplot(d['engine_coolant_temp'])
    plt.show()
In [33]:
for v_ in driveDF['vehicle_id'].unique():
    d = driveDF[driveDF['vehicle_id']==v_]
    sns.distplot(d['engine_coolant_temp'])
    plt.title("Coolant temp distribution across vehicles")
  • Generally, across vehicles, coolant temp is a multi-modal distribution with either 2 or 3 peaks

Across trips

In [34]:
for d_ in top10longTrips:
    d = driveDF[driveDF['trip_id']==d_]
    sns.distplot(d['engine_coolant_temp'])
    plt.title("Coolant temp distribution across 10 longest trips")
    i += 1
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-34-13785e5058c6> in <module>
      3     sns.distplot(d['engine_coolant_temp'])
      4     plt.title("Coolant temp distribution across 10 longest trips")
----> 5     i += 1

NameError: name 'i' is not defined
In [35]:
for d_ in top10shortTrips:
    d = driveDF[driveDF['trip_id']==d_]
    sns.distplot(d['engine_coolant_temp'])
    plt.title("Coolant temp distribution across 10 shortest trips")
  • Across longer trips the coolant temp follows a normal distribution
  • Shorter trips seem to have a wider distribution in terms of std.dev, however, some short trips do have steeper peaks

Across hour of day

In [36]:
### analysis of velocity by hours
uqHour = np.sort(driveDF['datetime'].dt.hour.unique())
for h_ in uqHour:
    d = driveDF[driveDF['datetime'].dt.hour==h_]
    plt.figure(figsize=(20,5))
    sns.distplot(d['engine_coolant_temp'])
    plt.title('Distribution of coolant temp across hour: {}'.format(h_))
    plt.show()
In [37]:
### analysis of velocity by hours
uqHour = driveDF['datetime'].dt.hour.unique()
plt.figure(figsize=(20,5))
for h_ in uqHour:
    d = driveDF[driveDF['datetime'].dt.hour==h_]
    sns.distplot(d['engine_coolant_temp'])
    plt.title('Distribution of coolant temp across hours')
  • As expected, during daytime the average coolant temp is around 160 ish, while it is lowest during night and eary morning hours
  • Multi-modal distribution with mostly 2 or 3 peaks

Multi-variate analysis for engine load

Across vehicles

In [38]:
for v_ in driveDF['vehicle_id'].unique():
    d = driveDF[driveDF['vehicle_id']==v_]
    sns.distplot(d['eng_load'])
    plt.show()
In [39]:
for v_ in driveDF['vehicle_id'].unique():
    d = driveDF[driveDF['vehicle_id']==v_]
    sns.distplot(d['eng_load'])
    plt.title("Engine Load distribution across vehicles")
  • Generally, across vehicles, coolant temp is a normal distribution with mean around 200

Across trips

In [40]:
for d_ in top10longTrips:
    d = driveDF[driveDF['trip_id']==d_]
    sns.distplot(d['eng_load'])
    plt.title("Engine Load distribution across 10 longest trips")
    i += 1
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-40-fb4754dcb83a> in <module>
      3     sns.distplot(d['eng_load'])
      4     plt.title("Engine Load distribution across 10 longest trips")
----> 5     i += 1

NameError: name 'i' is not defined
In [41]:
for d_ in top10shortTrips:
    d = driveDF[driveDF['trip_id']==d_]
    sns.distplot(d['eng_load'])
    plt.title("Engine Load distribution across 10 shortest trips")
  • Across longer trips the coolant temp follows a normal distribution
  • Shorter trips seem to have a wider distribution in terms of std.dev, however, some short trips do have steeper peaks

Across hour of day

In [42]:
### analysis of velocity by hours
uqHour = np.sort(driveDF['datetime'].dt.hour.unique())
for h_ in uqHour:
    d = driveDF[driveDF['datetime'].dt.hour==h_]
    plt.figure(figsize=(20,5))
    sns.distplot(d['eng_load'])
    plt.title('Distribution of engine load across hour: {}'.format(h_))
    plt.show()
In [43]:
### analysis of velocity by hours
uqHour = driveDF['datetime'].dt.hour.unique()
plt.figure(figsize=(20,5))
for h_ in uqHour:
    d = driveDF[driveDF['datetime'].dt.hour==h_]
    sns.distplot(d['eng_load'])
    plt.title('Distribution of engine load across hours')
  • Similar distribution of engine load acros hours

Multi-variate analysis for rpm

Across vehicles

In [44]:
for v_ in driveDF['vehicle_id'].unique():
    d = driveDF[driveDF['vehicle_id']==v_]
    sns.distplot(d['rpm'])
    plt.show()
In [45]:
for v_ in driveDF['vehicle_id'].unique():
    d = driveDF[driveDF['vehicle_id']==v_]
    sns.distplot(d['rpm'])
    plt.title("RPM distribution across vehicles")
  • Generally, across vehicles, rpm is a multimodal distribution with many peaks

Across trips

In [46]:
for d_ in top10longTrips:
    d = driveDF[driveDF['trip_id']==d_]
    sns.distplot(d['rpm'])
    plt.title("RPM distribution across 10 longest trips")
In [47]:
for d_ in top10shortTrips:
    d = driveDF[driveDF['trip_id']==d_]
    sns.distplot(d['rpm'])
    plt.title("RPM distribution across 10 shortest trips")
  • Across trips the coolant temp follows a normal distribution

Across hour of day

In [48]:
### analysis of velocity by hours
uqHour = np.sort(driveDF['datetime'].dt.hour.unique())
for h_ in uqHour:
    d = driveDF[driveDF['datetime'].dt.hour==h_]
    plt.figure(figsize=(20,5))
    sns.distplot(d['rpm'])
    plt.title('Distribution of RPM across hour: {}'.format(h_))
    plt.show()
In [49]:
### analysis of velocity by hours
uqHour = driveDF['datetime'].dt.hour.unique()
plt.figure(figsize=(20,5))
for h_ in uqHour:
    d = driveDF[driveDF['datetime'].dt.hour==h_]
    sns.distplot(d['rpm'])
    plt.title('Distribution of RPM across hours')
  • No specific pattern

End of analysis for Drive features